HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research
نویسندگان
چکیده
The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-hierarchical and balanced three-level classification scheme, where each news story has only one level-3 category (single-labeling). We refer to this original data set as HKIB-40075 test collection. And then Yonsei University and KISTI collaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrange the classification scheme to be fully hierarchical but unbalanced, and to assign one or more categories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000 test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and on HKIB-40075, illustrating properties of the collections, providing baseline results for future studies, and suggesting new directions for further research on Korean text categorization
منابع مشابه
Pushing "Underfitting" to the Limit: Learning in Bidimensional Text Categorization
The analysis of two heuristic supervised learning algorithms for text categorization in two dimensions is presented here. The graphical properties of the bidimensional representation allows one to tailor a geometrical heuristic approach in order to exploit the peculiar distribution of text documents. In particular, we want to investigate the theoretical linear cost of the algorithms and try to ...
متن کاملHigh rates of glucose utilization in the gas gland of Atlantic cod (Gadus morhua) are supported by GLUT1 and HK1b.
The gas gland of physoclistous fish utilizes glucose to generate lactic acid that leads to the off-loading of oxygen from haemoglobin. This study addresses characteristics of the first two steps in glucose utilization in the gas gland of Atlantic cod (Gadus morhua). Glucose metabolism by isolated gas gland cells was 12- and 170-fold higher, respectively, than that in heart and red blood cells (...
متن کاملHangul, the Korean Writing System, and its Computational Treatment
In this paper I will describe how the efficiency of the Hangul writing system can be transferred to its coding on computers. I will also review the encodings of Hangul by the precomposed Hangul code set of KSC 5601, a combinational Hangul code set, and ISO 2022, and their applications to the development of Hangul pro grams for sorting, editing, and processing Korean text. I then conclude with r...
متن کاملArabic Text Categorization Using Classification Rule Mining
Text categorization is one of the known problems in classification data mining. It aims to mapping text documents into one or more predefined class or category based on its contents of keywords. This problem has recently attracted many scholars in the data mining and machine learning communities since the numbers of online documents that hold useful information for decision makers, are numerous...
متن کاملA language model using variable length tokens for open-vocabulary Hangul text recognition
We propose a novel language model for Hangul text recognition. Without relying on prior linguistic knowledge in training, the proposed model learns variable length Hangul character sequences, which comprise the elementary tokens of Korean language, and their probabilities from statistics of a raw text corpus. Experiments in handwritten Hangul recognition shows that the proposed language model i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JCSE
دوره 3 شماره
صفحات -
تاریخ انتشار 2009